Homework 2
Introduction
In this study, the quarterly gasoline and diesel sales (in 1000 m3) of a major distributor between 2000 and 2006, and a number of potential factors dataset is given as the subject. Our initial aim is to inspect the relationships between these factors and sales, in addition to the real life determinants like seasonal effects and trends. After the analysis of the given variables, a model will be tried to fit to come up with predictions of the year 2007.
Visualization and Numerical Analysis
The first step to our research is manipulation of the dataset to fit the proper format for the time series regression modeling and have a general understanding of the variables by numerical analysis.
The variables provided as an input are given below: UGS: Unleaded gasoline sale in a given quarter, RNUV: An index indicating the rate of new unleaded gasoline using vehicles being added to the traffic in a quarter, PU: Average price (adjusted with an index) of a liter of unleaded gasoline in a quarter, PG: Average price (adjusted with an index) of a liter of diesel gasoline in a quarter, NUGV: Number of unleaded gasoline using vehicles in the traffic, NDGV: Number of diesel gasoline using vehicles in the traffic (per 1000 people), GNPA: Agriculture component of Gross National Product (adjusted with an index), GNPC: Commerce component of Gross National Product (adjusted with an index), GNP: Grand total for GNP (agriculture, commerce and other components total).
salesData <- read.csv("/Users/larahos/Desktop/HW2_data.csv")
colnames(salesData) <- c("Quarter","UGS","RNUV","NLPG","PU","PG","NUGV","NDGV","GNPA","GNPC","GNP")
for(i in c(2,4,7,9,10,11)) {
salesData[,i] <- gsub( " ", "", salesData[,i])
salesData[,i]=as.numeric(salesData[,i])
}
salesData[,"Quarter"]<-as.yearqtr(salesData[,"Quarter"],format="%Y _Q %q")
sales_data <- data.table(salesData)
salesDataForecast <- salesData[c(29:32),]## Quarter UGS RNUV NLPG PU PG NUGV NDGV GNPA GNPC
## 1: 2000 Q1 1128971 0.0146 940000 469.03 355.69 4647500 281.9853 1040173 3483132
## 2: 2000 Q2 1199569 0.0205 941000 459.42 344.58 4742876 284.0813 1760460 4525451
## 3: 2000 Q3 1370167 0.0207 943500 439.98 327.21 4840931 286.7169 6974808 5915204
## 4: 2000 Q4 1127548 0.0163 948000 402.08 300.67 4919685 288.3137 3267125 4929778
## 5: 2001 Q1 1033918 0.0071 950000 411.58 305.75 4954754 287.6237 1004528 3418387
## 6: 2001 Q2 1019754 0.0051 955000 520.39 374.78 4980204 287.8814 1449357 4359831
## GNP
## 1: 18022686
## 2: 21797130
## 3: 30050207
## 4: 24480153
## 5: 15832648
## 6: 20296918
## Quarter UGS RNUV NLPG
## Min. :2000 Min. : 736580 Min. :0.001200 Min. : 940000
## 1st Qu.:2002 1st Qu.: 876210 1st Qu.:0.004100 1st Qu.: 997500
## Median :2004 Median : 983573 Median :0.007250 Median :1352000
## Mean :2004 Mean : 987212 Mean :0.008906 Mean :1299922
## 3rd Qu.:2006 3rd Qu.:1097324 3rd Qu.:0.013150 3rd Qu.:1602000
## Max. :2008 Max. :1370167 Max. :0.020700 Max. :1797400
## NA's :4
## PU PG NUGV NDGV
## Min. :402.1 Min. :300.7 Min. :4647500 Min. :282.0
## 1st Qu.:479.2 1st Qu.:366.9 1st Qu.:5029281 1st Qu.:286.6
## Median :526.7 Median :401.6 Median :5186068 Median :291.9
## Mean :519.2 Mean :400.4 Mean :5296758 Mean :304.9
## 3rd Qu.:565.2 3rd Qu.:449.2 3rd Qu.:5604709 3rd Qu.:322.9
## Max. :609.0 Max. :471.9 Max. :6065597 Max. :357.3
##
## GNPA GNPC GNP
## Min. : 832953 Min. :3374849 Min. :15832648
## 1st Qu.:1348066 1st Qu.:4403081 1st Qu.:21790148
## Median :1850657 Median :5124300 Median :25229712
## Mean :2895561 Mean :5246534 Mean :25745763
## 3rd Qu.:3922950 3rd Qu.:5967433 3rd Qu.:29843404
## Max. :7140722 Max. :7480414 Max. :36741745
##
## Classes 'data.table' and 'data.frame': 32 obs. of 11 variables:
## $ Quarter: 'yearqtr' num 2000 Q1 2000 Q2 2000 Q3 2000 Q4 ...
## $ UGS : num 1128971 1199569 1370167 1127548 1033918 ...
## $ RNUV : num 0.0146 0.0205 0.0207 0.0163 0.0071 0.0051 0.0041 0.0048 0.0012 0.0032 ...
## $ NLPG : num 940000 941000 943500 948000 950000 ...
## $ PU : num 469 459 440 402 412 ...
## $ PG : num 356 345 327 301 306 ...
## $ NUGV : num 4647500 4742876 4840931 4919685 4954754 ...
## $ NDGV : num 282 284 287 288 288 ...
## $ GNPA : num 1040173 1760460 6974808 3267125 1004528 ...
## $ GNPC : num 3483132 4525451 5915204 4929778 3418387 ...
## $ GNP : num 18022686 21797130 30050207 24480153 15832648 ...
## - attr(*, ".internal.selfref")=<externalptr>
Data Visualization
The graph and the smoothing lines are drawn below to visually inspect the dataset.
ggplot(sales_data, aes(x=Quarter, y=UGS)) +
geom_line(color="turquoise4") +
theme_minimal() +
labs(x="", y="Sales", title="Unleaded Gasoline Sales per Quarter 2000-2006") +
theme(plot.title = element_text(hjust=0.5, size=20, face="bold"))+
geom_smooth(formula = y ~ x, method = "lm") The dataset shows a decreasing trend over time as seen in the graph above from 2000 to 2007. Furthermore, from the line it can be concluded that the variance is getting relatively smaller around 2003 and gets larger afterwards. To have a better understanding, the datatset is decomposed below:
datats <- ts(sales_data$UGS,freq=4,start=c(2000,1))
decom_data <- decompose(datats, "additive")
plot(decom_data) When the time series is decomposed, it is seen that there is a significant decreasing trend over time. Also the seasonality effect is seen as having higher values of sales in Q3 each year. The dataset is not stationary. The aim of this project is basically after including seasonal and trend variables, explaining the random residuals in terms of other variables.
Plotting Autocorrelation
acf(sales_data$UGS[1:28])As illustrated above, there is a significant lag at lag 1. This lag value is a sign of trend, each year being affected by the past years sales. There is a significant lag at lag 4 and relatively high lag at 8, which proves the fact of seasonal effects.
Model Building for Time Series Regression
As a conclusion of the visual and acf analysis of the dataset, trend and seasonality information is being added to the dataset, in order to use in following steps for model fitting and prediction.
#added trend information
sales_data[,trend:=1:.N]
#add quarter information to use in seasonality afterwards
Q=seq(1,4,by=1)
sales_data=cbind(sales_data,Q)Trend and seasonality approaches to all dataset as a whole, but also lagged values can be a godd guide to catch autocorrelation and dependency betweeen UGS value. Therefore, we added lagged values Y(t-1) and Y(t-4) to the dataset.
#add lagged values of lag1 and lag4
sales_data$lag1 <- NA
sales_data$lag4 <- NA
sales_data$lag1 <- dplyr::lag(sales_data$UGS)
sales_data$lag4 <- dplyr::lag(sales_data$UGS, 4L)Analysis of the Independent Variables
To understand which variables have a significant correlation between, it is logical to print the output by using ggpairs.
ggpairs(sales_data[,-1]) There is a significant correlation between UGS and NLPG, NUGV, GNPA, NDGV; also trend and lag 4 values as expected. PU and PG values are highly correlated with each other. GNPA, GNPC and GNP values are also have a high correlation. The main goal is to include variables that effects UGS the most but have a low correlation internally.
Model Fitting
The model will be built step by step, checking significance of variables in each step and adding them in a logical order. Our main goal is to increase adjusted R-squared value while satisfying the assumption of random, independent errors with mean 0 and constant variance.
model1 <- lm(sales_data$UGS~trend, data = sales_data)
summary(model1)##
## Call:
## lm(formula = sales_data$UGS ~ trend, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -199945 -73550 -21904 71224 237369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1170777 43292 27.043 < 2e-16 ***
## trend -12660 2608 -4.854 4.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 111500 on 26 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.4754, Adjusted R-squared: 0.4552
## F-statistic: 23.56 on 1 and 26 DF, p-value: 4.945e-05
checkresiduals(model1)##
## Breusch-Godfrey test for serial correlation of order up to 6
##
## data: Residuals
## LM test = 19.942, df = 6, p-value = 0.002836
The initial model has started with the obvious trend, it has a low R squared value and also autocorrelation is significantly high. To overcome the lag effect, two methods will be tried; lag 4 and seasonality. The one with the higher R-squared value will be chosen to move on for the further models.
model2 <- lm(sales_data$UGS~trend+lag4, data = sales_data)
summary(model2)##
## Call:
## lm(formula = sales_data$UGS ~ trend + lag4, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66549 -42583 -989 47813 81027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.371e+04 1.178e+05 0.710 0.485
## trend 2.084e+03 1.960e+03 1.063 0.300
## lag4 8.254e-01 9.262e-02 8.912 1.4e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49740 on 21 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8525, Adjusted R-squared: 0.8385
## F-statistic: 60.69 on 2 and 21 DF, p-value: 1.87e-09
checkresiduals(model2)##
## Breusch-Godfrey test for serial correlation of order up to 6
##
## data: Residuals
## LM test = 2.5252, df = 6, p-value = 0.8656
model3 <- lm(sales_data$UGS~trend+as.factor(Q), data = sales_data)
summary(model3)##
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q), data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81167 -31283 -3458 28640 94502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1060372 23653 44.830 < 2e-16 ***
## trend -13497 1147 -11.764 3.28e-11 ***
## as.factor(Q)2 121532 25987 4.677 0.000104 ***
## as.factor(Q)3 273618 26063 10.498 3.03e-10 ***
## as.factor(Q)4 95049 26189 3.629 0.001405 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48570 on 23 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.9119, Adjusted R-squared: 0.8966
## F-statistic: 59.53 on 4 and 23 DF, p-value: 8.446e-12
checkresiduals(model3)##
## Breusch-Godfrey test for serial correlation of order up to 8
##
## data: Residuals
## LM test = 11.462, df = 8, p-value = 0.1769
Even though lag 4 value has dealt with autocorrelation better, it reduced the effect of trend and has a lower R squared value. Therefore factorized seasonal values are chosen to move forward.
model4 <- lm(sales_data$UGS~trend+as.factor(Q)+ NLPG+PU+PG+NUGV+NDGV+GNPA , data = sales_data)
summary(model4)##
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NLPG + PU +
## PG + NUGV + NDGV + GNPA, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67794 -10461 1987 18461 32208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.094e+06 8.283e+05 3.735 0.00165 **
## trend 1.514e+04 1.284e+04 1.178 0.25486
## as.factor(Q)2 1.463e+05 2.356e+04 6.212 9.48e-06 ***
## as.factor(Q)3 4.953e+05 1.597e+05 3.102 0.00648 **
## as.factor(Q)4 1.607e+05 4.385e+04 3.664 0.00192 **
## NLPG -2.937e-01 1.804e-01 -1.628 0.12195
## PU 2.276e+01 1.017e+03 0.022 0.98241
## PG -1.129e+03 1.437e+03 -0.786 0.44295
## NUGV -1.023e+00 3.343e-01 -3.061 0.00708 **
## NDGV 1.238e+04 3.581e+03 3.456 0.00302 **
## GNPA -3.613e-02 2.764e-02 -1.307 0.20854
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29700 on 17 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.9757, Adjusted R-squared: 0.9613
## F-statistic: 68.12 on 10 and 17 DF, p-value: 1.055e-11
checkresiduals(model4)##
## Breusch-Godfrey test for serial correlation of order up to 14
##
## data: Residuals
## LM test = 26.623, df = 14, p-value = 0.02154
The variables which showed a high correlation with UGS values are included in the model. As seen in the summary, GNPA, PU and PG do not have a significant effect, therefore they will be eliminated. The main reason behind this could be related to the fact that correlation does not indicate to causality. Even though they are eliminated in this context, that only implies that they are insignificant when they are included as this set of variables. Furthermore, our trend lost its significance, however, it will be held in the model in the following steps. It may have lost its significance due to the fact that decreasing trend is demonstrated by many other independent variables in the model, so it is safer to hold it in at least one more step. Also the model has a high lag 1 value.
model5 <- lm(sales_data$UGS~trend+as.factor(Q)+ NUGV +NDGV, data = sales_data)
summary(model5)##
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NUGV + NDGV,
## data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67804 -17414 3897 19110 59734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.765e+06 4.825e+05 3.658 0.001467 **
## trend -6.596e+03 3.908e+03 -1.688 0.106254
## as.factor(Q)2 1.242e+05 1.898e+04 6.543 1.76e-06 ***
## as.factor(Q)3 2.798e+05 1.917e+04 14.592 1.83e-12 ***
## as.factor(Q)4 1.071e+05 1.979e+04 5.412 2.28e-05 ***
## NUGV -5.356e-01 1.590e-01 -3.369 0.002899 **
## NDGV 6.619e+03 1.421e+03 4.657 0.000135 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35420 on 21 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.9572, Adjusted R-squared: 0.945
## F-statistic: 78.32 on 6 and 21 DF, p-value: 2.816e-13
checkresiduals(model5)##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 14.184, df = 10, p-value = 0.1647
When we eliminated the insignificant variables, the R squared value has decreased, so we will try to add each eliminated variables one by one to catch the one resulting in the best improvement.
model6 <- lm(sales_data$UGS~trend+as.factor(Q)+ NUGV + NDGV + PU, data = sales_data)
summary(model6)##
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NUGV + NDGV +
## PU, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60791 -13274 756 16273 55213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.869e+06 4.103e+05 4.554 0.000193 ***
## trend -4.941e+03 3.356e+03 -1.472 0.156576
## as.factor(Q)2 1.313e+05 1.626e+04 8.077 1.00e-07 ***
## as.factor(Q)3 2.890e+05 1.653e+04 17.484 1.38e-13 ***
## as.factor(Q)4 1.032e+05 1.682e+04 6.132 5.43e-06 ***
## NUGV -5.136e-01 1.349e-01 -3.807 0.001105 **
## NDGV 6.680e+03 1.205e+03 5.545 1.99e-05 ***
## PU -5.142e+02 1.692e+02 -3.039 0.006474 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30020 on 20 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.9707, Adjusted R-squared: 0.9605
## F-statistic: 94.79 on 7 and 20 DF, p-value: 6.1e-14
checkresiduals(model6)##
## Breusch-Godfrey test for serial correlation of order up to 11
##
## data: Residuals
## LM test = 20.192, df = 11, p-value = 0.04278
PU resulted in a good improvement in the model, so it is also included in the model. Lag 1 will tried to be dealt in the following steps.
model7 <- lm(sales_data$UGS~trend+as.factor(Q)+ NDGV + PU + NUGV + lag1, data = sales_data)
summary(model7)##
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NDGV + PU +
## NUGV + lag1, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44528 -10996 -4285 15377 39939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.796e+06 6.050e+05 4.621 0.000212 ***
## trend -8.559e+03 3.562e+03 -2.403 0.027253 *
## as.factor(Q)2 9.041e+04 2.366e+04 3.821 0.001252 **
## as.factor(Q)3 3.099e+05 1.729e+04 17.924 6.33e-13 ***
## as.factor(Q)4 1.985e+05 4.069e+04 4.878 0.000121 ***
## NDGV 9.724e+03 2.033e+03 4.782 0.000149 ***
## PU -6.308e+02 1.628e+02 -3.876 0.001108 **
## NUGV -7.551e-01 2.011e-01 -3.755 0.001450 **
## lag1 -4.865e-01 1.947e-01 -2.499 0.022362 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27090 on 18 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.9778, Adjusted R-squared: 0.9679
## F-statistic: 99.11 on 8 and 18 DF, p-value: 2.71e-13
checkresiduals(model7)##
## Breusch-Godfrey test for serial correlation of order up to 12
##
## data: Residuals
## LM test = 20.476, df = 12, p-value = 0.05859
When lag 1 is included, all variables have significance which is a good sign for the validity of the model. However, these much variable pushed the lag 4 higher and adding lag 4 decreased R squared a lot. So we will try some alternative methods to search for a better conclusion.
sales_data$NUGVlag1 <- dplyr::lag(sales_data$NUGV, 1L)
sales_data$NDGVlag4 <- dplyr::lag(sales_data$NDGV, 4L)
sales_data$PUlag4 <- dplyr::lag(sales_data$PU, 4L)
sales_data$NLPGlag4 <- dplyr::lag(sales_data$NLPG, 4L)
sales_data$RNUVlag4 <- dplyr::lag(sales_data$RNUV, 4L)
sales_data$PUlag1 <- dplyr::lag(sales_data$PU, 1L)
model8 <- lm(sales_data$UGS~trend+as.factor(Q)+ NDGV + PU + lag1, data = sales_data)
summary(model8)##
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NDGV + PU +
## lag1, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74477 -15560 3068 24018 44993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.642e+05 2.721e+05 2.441 0.02463 *
## trend -1.508e+04 4.041e+03 -3.731 0.00141 **
## as.factor(Q)2 1.398e+05 2.557e+04 5.466 2.84e-05 ***
## as.factor(Q)3 2.991e+05 2.216e+04 13.498 3.47e-11 ***
## as.factor(Q)4 1.089e+05 4.284e+04 2.542 0.01990 *
## NDGV 2.624e+03 9.712e+02 2.701 0.01415 *
## PU -6.498e+02 2.115e+02 -3.073 0.00626 **
## lag1 -4.920e-02 2.028e-01 -0.243 0.81090
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35210 on 19 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.9604, Adjusted R-squared: 0.9458
## F-statistic: 65.85 on 7 and 19 DF, p-value: 5.513e-12
checkresiduals(model8)##
## Breusch-Godfrey test for serial correlation of order up to 11
##
## data: Residuals
## LM test = 12.914, df = 11, p-value = 0.299
model9 <- lm(sales_data$UGS~trend+as.factor(Q)+ NDGV + PU + NUGVlag1 +lag1, data = sales_data)
summary(model9)##
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NDGV + PU +
## NUGVlag1 + lag1, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46810 -15769 -4205 15933 45342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.462e+06 5.321e+05 4.627 0.000209 ***
## trend -1.183e+04 3.258e+03 -3.631 0.001911 **
## as.factor(Q)2 8.401e+04 2.497e+04 3.365 0.003449 **
## as.factor(Q)3 3.027e+05 1.722e+04 17.574 8.87e-13 ***
## as.factor(Q)4 1.834e+05 3.892e+04 4.712 0.000174 ***
## NDGV 6.726e+03 1.345e+03 4.999 9.30e-05 ***
## PU -6.872e+02 1.644e+02 -4.180 0.000563 ***
## NUGVlag1 -5.043e-01 1.370e-01 -3.681 0.001709 **
## lag1 -5.021e-01 1.998e-01 -2.514 0.021683 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27330 on 18 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.9774, Adjusted R-squared: 0.9674
## F-statistic: 97.38 on 8 and 18 DF, p-value: 3.162e-13
checkresiduals(model9)##
## Breusch-Godfrey test for serial correlation of order up to 12
##
## data: Residuals
## LM test = 24.932, df = 12, p-value = 0.01515
Different lagged values of variables are tried for pulling autocorrelation below the upper limits. Lagged value of NUGV resulted in a good model, with all regressors being significant variables, relatively low lag4 value and more random looking error.
Evaluation of Final Model
Model 9 is chosen to be our final model with significant variables, good R squared value and errors. Below the plots to prove the validity is printed:
finalmodel <- model9
summary(model9)##
## Call:
## lm(formula = sales_data$UGS ~ trend + as.factor(Q) + NDGV + PU +
## NUGVlag1 + lag1, data = sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46810 -15769 -4205 15933 45342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.462e+06 5.321e+05 4.627 0.000209 ***
## trend -1.183e+04 3.258e+03 -3.631 0.001911 **
## as.factor(Q)2 8.401e+04 2.497e+04 3.365 0.003449 **
## as.factor(Q)3 3.027e+05 1.722e+04 17.574 8.87e-13 ***
## as.factor(Q)4 1.834e+05 3.892e+04 4.712 0.000174 ***
## NDGV 6.726e+03 1.345e+03 4.999 9.30e-05 ***
## PU -6.872e+02 1.644e+02 -4.180 0.000563 ***
## NUGVlag1 -5.043e-01 1.370e-01 -3.681 0.001709 **
## lag1 -5.021e-01 1.998e-01 -2.514 0.021683 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27330 on 18 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.9774, Adjusted R-squared: 0.9674
## F-statistic: 97.38 on 8 and 18 DF, p-value: 3.162e-13
plot(model9)Residuals vs Fitted plot is almost a straight line, especially for the high values of UGS. Q-Q plot fits the line in most of the points which proves the normal distribution assumption of errors.
Forecasts of UGS Values 2007
First, the final model will predict the given dataset of first 28 Quarters, which will be represented on the plot below. Since they move together mostly and it gmade a good job catching the peaks and the trend, the model will be used for predicting the last 4 values.
final_plot <- sales_data
final_plot[,actual:=UGS]
final_plot[,predicted:=predict(finalmodel,final_plot)]
ggplot(final_plot ,aes(x=Quarter)) +
geom_line(aes(y=UGS,color='actual')) +
geom_line(aes(y=predicted,color='predicted'))+
labs(title = "Actual vs. Fitted UGS", x = "Quarter", y = "UGS (in 1000 m3)")## Warning: Removed 4 row(s) containing missing values (geom_path).
## Removed 4 row(s) containing missing values (geom_path).
Forecasts for the last 4 UGS values are given below:
prediction_set = sales_data[(29:32),c("UGS","NDGV","lag1","NUGVlag1","PU", "trend", "Q")]
for(i in 1:4) {
prediction_set[i,1] = predict(finalmodel,newdata = prediction_set[i,])
if(i<4){
prediction_set[i+1,"lag1"] = prediction_set[i,1]
}
}
prediction_set## UGS NDGV lag1 NUGVlag1 PU trend Q
## 1: 656222.1 342.1729 872000.0 5825866 565.19 29 1
## 2: 847051.5 346.9407 656222.1 5869018 565.19 30 2
## 3: 956963.8 351.4449 847051.5 5931348 565.19 31 3
## 4: 779731.7 357.2902 956963.8 5991280 565.19 32 4
Conclusion
In this homework, initially a dataset is analyzed whether it is stationary or not. Then the variables are inspected, several models tried and the one with the highest R squared and better in error distribution is chosen to predict last 4 values.